Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian

نویسندگان

  • Maja Popovic
  • Kostadin Cholakov
  • Valia Kordoni
  • Nikola Ljubesic
چکیده

Massive Open Online Courses have been growing rapidly in size and impact. Yet the language barrier constitutes a major growth impediment in reaching out all people and educating all citizens. A vast majority of educational material is available only in English, and state-of-the-art machine translation systems still have not been tailored for this peculiar genre. In addition, a mere collection of appropriate in-domain training material is a challenging task. In this work, we investigate statistical machine translation of lecture subtitles from English into Croatian, which is morphologically rich and generally weakly supported, especially for the educational domain. We show that results comparable with publicly available systems trained on much larger data can be achieved if a small in-domain training set is used in combination with additional in-domain corpus originating from the closely related Serbian language.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring cross-language statistical machine translation for closely related South Slavic languages

This work investigates the use of crosslanguage resources for statistical machine translation (SMT) between English and two closely related South Slavic languages, namely Croatian and Serbian. The goal is to explore the effects of translating from and into one language using an SMT system trained on another. For translation into English, a loss due to cross-translation is about 13% of BLEU and ...

متن کامل

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool fo...

متن کامل

The SETimes.HR Linguistically Annotated Corpus of Croatian

We present SETIMES.HR— the first linguistically annotated corpus of Croatian that is freely available for all purposes. The corpus is built on top of the SETIMES parallel corpus of nine Southeast European languages and English. It is manually annotated for lemmas, morphosyntactic tags, named entities and dependency syntax. We couple the corpus with domain-sensitive test sets for Croatian and Se...

متن کامل

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for...

متن کامل

Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian

This paper describes our experience using available linguistic resources for Croatian in order to address data sparseness when building an English-to-Croatian general domain phrasebased statistical machine translation system. We report the results obtained with factored translation models and morphological expansion, highlight the impact of the algorithm used for tagging the corpora, and show t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016